fix(subagent): make the system prompt a fixed trust boundary#2
Merged
Conversation
The sub-agent system prompt was parent-writable: delegate_tasks exposed a `system` field that REPLACED the prompt wholesale (dropping the SAFETY block), and buildSubagentPrompt embedded the raw, parent-supplied goal text directly into the system message. Since goal/context can carry text the parent ingested from untrusted sources (fetched pages, MCP output, files), a prompt-injection payload could redefine the sub-agent's identity or strip its anti-injection rules. Harden the boundary: - The sub-agent system prompt is now a FIXED, code-defined constant. Nothing the parent supplies is ever spliced into it. Strengthen its SAFETY block and state that the request is data, not identity. - All parent guidance moves into the user REQUEST via buildSubagentRequest() (goal + guidance + context). Remove buildSubagentPrompt and the taskSystem / ODEK_SYSTEM overrides for sub-agents. - Rename the delegate_tasks `system` field to `guidance` (how to approach the task — delivered in the request), and re-describe it. - When trust_level=untrusted, wrap the request body in an <untrusted_input> fence (defense-in-depth alongside the existing applySubagentTrust clamp). Tests: drop the obsolete buildSubagentPrompt persona tests; add subagent_prompt_isolation_test.go asserting the system prompt is unaffected by (even hostile) parent input, that the request carries goal/guidance/context, and that untrusted tasks are fenced. Update schema + e2e tests (system->guidance). Docs: rewrite SUBAGENTS.md "system prompt & request" section and update SECURITY.md §7. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make the sub-agent system prompt a fixed trust boundary the parent agent cannot write to. All parent-supplied steering moves into the sub-agent's request, where the SAFETY rules frame it as data — not as identity-defining instructions.
The problem
delegate_taskslet the parent set a per-tasksystemfield that replaced the sub-agent's system prompt wholesale (dropping the SAFETY/anti-injection block), andbuildSubagentPromptembedded the raw, parent-supplied goal text directly into the system message. Becausegoal/contextcan carry text the parent ingested from untrusted sources (fetched pages, MCP output, files), a prompt-injection payload could redefine a sub-agent's identity or strip its safety rules.The fix
subagentSystem). Nothing the parent supplies is ever spliced into it. Its SAFETY block is strengthened and explicitly states the request is data, not identity.goal+guidance+contextare assembled into the user message bybuildSubagentRequest(). RemovedbuildSubagentPromptand thetaskSystem/ODEK_SYSTEMoverrides for sub-agents.system→guidance. Thedelegate_tasksfield is renamed and re-described as how to approach the task (request-level), not a system prompt.trust_level: "untrusted", the request body is wrapped in an<untrusted_input>fence — defense-in-depth alongside the existingapplySubagentTrustpermission clamp.Tests
subagent_prompt_isolation_test.go: the system prompt is unaffected by (even hostile) parent input; the request carries goal/guidance/context; untrusted tasks are fenced; trusted ones are not.buildSubagentPromptpersona tests; updated the tool-schema test (system→guidance, assertssystemis absent) and the e2e tests.go build ./...,go vet,gofmt, and thecmd/odeksuite pass.Docs
docs/SUBAGENTS.md: replaced "Dynamic system prompts" with "System prompt & request (trust boundary)".docs/SECURITY.md§7: documents the fixed-prompt boundary and what changed.Behavior change / compat
The
systemfield ondelegate_tasksis removed (replaced byguidance), andODEK_SYSTEM/configsystemno longer apply to sub-agents. The tool schema is regenerated each run, so there are no external consumers; the parent model is steered via the new field description. The dynamic persona auto-selection is intentionally dropped — approach is now expressed viaguidance.🤖 Generated with Claude Code